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We consider methods for quantifying the similarity of vertices in networks. We propose a measure 
of similarity based on the concept that two vertices are similar if their immediate neighbors in the 
network are themselves similar. This leads to a self-consistent matrix formulation of similarity that 
can be evaluated iteratively using only a knowledge of the adjacency matrix of the network. We test 
our similarity measure on computer-generated networks for which the expected results are known, 
and on a number of real-world networks. 
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I. INTRODUCTION AND BACKGROUND 

The study of networked systems, including computer 
networks, social networks, biological networks, and oth- 
ers, has attracted considerable attention in the recent 
physics literature Q, 0, ■ A number of structural prop- 
erties of networks have been the subject of particularly 
intense scrutiny, including the lengths of paths between 
vertices 0, H, IE H, degree distributions d, El E3 , com- 
munity structure [Xl l [ 131 Ibi . and various measures 
of vertex centrality 0, flol Ha. fl7| . 

Another important network concept that has received 
comparatively little attention is vertex similarity. There 
are many situations in which it would be useful to be able 
to answer questions such as "How similar are these two 
vertices?" or "Which other vertices are most similar to 
this vertex?" Of course, there are many senses in which 
two vertices can be similar. In the network of the World 
Wide Web, for instance, in which vertices represent Web 
pages, two pages might be considered similar if the text 
appearing on them contains many of the same words. 
In a social network representing friendship between indi- 
viduals, two people might be considered similar if they 
have similar professions, interests, or backgrounds. In 
this paper we consider ways of determining vertex simi- 
larity based solely on the structure of a network. Given 
only the pattern of edges between vertices in a network, 
we ask, can we define useful measures that tell us when 
two vertices are similar? Similarity of this type is some- 
times called structural similarity, to distinguish it from 
social similar, textual similarity, or other similarity types. 
It is a basic premise of research on networks that the 
structure of a network reflects real information about the 
vertices the network connects, so it appears reasonable 
that meaningful structural similarity measures might ex- 
ist. Here we show that indeed they do and that they can 
return useful information about networks. 

The problem of quantifying similarity between vertices 
in a network is not a new one. The most common ap- 
proach in previous work has been to focus on so-called 
structural equivalence. Two vertices are considered struc- 
turally equivalent if they share many of the same network 
neighbors. For instance, it seems reasonable to conclude 
that two individuals in a social network have something 



in common if they share many of the same friends. Let Ti 
be the neighborhood of vertex i in a network, i.e., the set 
of vertices that are directly connected to i via an edge. 
Then the number of common friends of i and j is 



= Wi n r., 



(i) 



where \x\ indicates the cardinality (i.e., number of ele- 
ments in) the set x, so that for instance, is simply 
equal to the degree of vertex i. 

The quantity cr unnorm can be regarded as a rudimentary 
measure of similarity between i and j. It is, however, not 
entirely satisfactory. It can take large values for vertices 
with high degree even if only a small fraction of their 
neighbors are the same, and in many cases this runs con- 
trary to our intuition about what constitutes similarity. 
Commonly therefore one normalizes in some way — for in- 
stance so that the similarity is one when Ti = Tj. We 
are aware of at least three previously-proposed ways of 
doing this 00, 13: 
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The first of these, commonly called the Jaccard index, 
was proposed by Jaccard over a hundred years ago |l8j| : 
the second, called the cosine similarity, was proposed 
by Salton in 1983 and has a long history of study in 
the literature on citation networks D3EI, El- (Mea- 
sures nonlinear in <7 unnorm are also possible. For example, 
Refs. |2j| and [24| propose measures involving -y/cr unnorm 
and (Junnorm: respectively.) 

There are, however, many cases in which vertices oc- 
cupy similar structural positions in networks without 
having common neighbors. For instance, two store clerks 
in different towns occupy similar social positions by 
virtue of their numerous professional interactions with 
customers, although it is quite likely that they have none 
of those customers in common. Two CEOs of companies 
occupy similar positions by virtue of their contacts with 
other high-ranking officers of the companies for which 
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FIG. 1: A vertex j is similar to vertex i (dashed line) if i has 
a network neighbor v (solid line) that is itself similar to j. 



they work, although again none of the individual officers 
need be common to both companies. Considerations of 
this kind lead us to an extended definition of network 
similarity known as regular equivalence. In this case ver- 
tices are said to be similar if they are connected to other 
vertices that are themselves similar. It is upon this idea 
that the measures developed in this paper are based. 

Regular equivalence is clearly a self-referential con- 
cept: one needs to know the similarity of the neighbors 
of two vertices before one can compute the similarity of 
the two vertices themselves. It comes as no surprise to 
learn, therefore, that traditional algorithms for comput- 
ing regular equivalence have an iterative or recursive na- 
ture. Two of the best known such algorithms, REGE and 
CATREGE j2f| , proceed by searching for optimal match- 
ing between the neighbors of the two vertices, while other 
authors have formulated the calculation as a optimization 
problem p3 |. 

In this paper, we take a different approach, construct- 
ing measures of similarity using the methods of linear 
algebra. The fundamental statement of our approach is 
that vertices i and j are similar if either of them has 
a neighbor v that is similar to the other — see Fig. ^ 
Coupled with the additional assumption that vertices are 
trivially similar to themselves, this gives, as we will see, 
a sensible and straightforward formulation of the con- 
cept of regular equivalence for undirected networks. The 
method has substantial advantages over other similarity 
measures: it is global — unlike the Jaccard index and re- 
lated measures, it depends on the whole graph and allows 
vertices to be similar without sharing neighbors; it has 
a transparent theoretical rationale, which more complex 
methods like REGE and CATREGE lack [U; it avoids 
the convergence problems that have plagued optimiza- 
tion methods; and it is comparatively fast, since its im- 
plementation can take advantage of standard, hardware 
optimized, linear algebra software. 

Some previous authors have also considered similarity 
measures based on matrix methods [27], 13 ■ We discuss 
the differences between our measure and these previous 
ones in Section III CI 

This paper is organized as follows: In Section [H] we 
present the derivation of our structural similarity mea- 
sure. Section IIIII we test the measure on a number 
of networks, including computer-generated graphs (Sec- 
tions IIII Al and IIII B|) and real-world examples (Sec- 



tions IIII CI and IIII D|) . In Section IIVI we give our con- 
clusions. 



II. A MEASURE OF SIMILARITY 

The fundamental principle behind our measure of 
structural similarity in networks is that i is similar to 
j if i has a network neighbor v that is itself similar to j 
(Fig. 2J. Alternatively, swapping i and j, we could say 
that j is similar to i if it has a neighbor v that is similar 
to i. Despite the apparent asymmetry between i and j in 
these statements we will see that they both lead to the 
same similarity measure, which is perfectly symmetric. 

Our definition of similarity is clearly recursive and 
hence we need to provide some starting point for the re- 
cursion in order to make the results converge to a useful 
limit. The starting point we choose is to make each ver- 
tex similar to itself, which is natural in most situations. 
Our definition of similarity will thus have two compo- 
nents: the neighbor term of the previous paragraph and 
the self-similarity. 

Thus our first guess at the form of the similarity (we 
will improve it later) is to write the similarity Sij of ver- 
tex i to vertex j as 

Sij = ^y]^ivS v j +4>Sij, (3) 

V 

where Aij is an element of the adjacency matrix of the 
network taking the value 

. f 1 if there is an edge between i and j, 
y 1 otherwise, 

(4) 

and (j) and tp are free parameters whose values control the 
balance between the two components of the similarity. 

Considering Sy to be the ij element of a similarity 
matrix S, we can write Eq. (|3J in matrix form as 

S = 0AS + i/>I, (5) 

where I is the identity matrix. Rearranging, this can also 
be written S = — 0A] _1 . As we see, the parameter %jj 
merely contributes an overall multiplicative factor to our 
similarity. Since in essentially all cases we will be con- 
cerned not with the absolute magnitude of the similarity 
but only with the relative similarity of different pairs of 
vertices, we can safely set if) = 1, eliminating one of our 
free parameters, and giving 

S= [I-0A]- 1 . (6) 

This expression for similarity bears a close relation to the 
matrix-based centrality measure of Katz j2{j . In fact, the 
Katz centrality of a vertex is equal simply to the sum of 
that vertex's similarities to every other vertex. This is a 
natural concept: a vertex is prominent in a network if it 
is closely allied with many other vertices. 
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We can also consider the similarity of i and j when j 
has a neighbor v that is similar to i. In that case, 



pair 



> ^ Siv A v j 



'j ■ 



(7) 



It is trivial to show however that this leads to precisely 
the same expression, Eq. (|BJ), for the similarity in the end. 
Thus our definition provides only one similarity value for 
any pair of vertices, given by the symmetric matrix S of 
Eq. ®. 

The remaining parameter (j> in Eq. © is still free. To 
shed light on the appropriate value for this parameter, 
let us expand the similarity as a power scries thus: 



S = I + 4>A 



(8) 



Noting that the element is equal to the number 

of (possibly self-intersecting) network paths of length I 
from i to j, this equation gives us an alternative, term- 
by-term interpretation of our similarity measure. The 
first term says that a vertex is identically similar to itself. 
The second term says that vertices that are immediate 
neighbors of one another have similarity <j). The third 
term says that vertices that are distance two apart on 
the network have similarity 4> 2 . And so forth. 

But notice also that vertices that have many paths of a 
given length are considered more similar than those that 
have few. The similarity of vertices i and j acquires a 
contribution <ft 2 for every path of length 2 from i to j. We 
note however that some pairs of vertices are expected to 
have one or even many such paths between them: vertices 
with very high degree, for instance, will almost certainly 
have one or several paths of length two connecting them, 
even if connections between vertices are just made at 
random. So simple counts of number of paths are not 
enough to establish similarity. We need to know when a 
pair of vertices has more paths of a given length between 
than we would expect by chance. 

This suggests a strategy for choosing </>. We will nor- 
malize each term in our series by dividing the number of 
paths of length I (given by the power of the adjacency 
matrix) by the expected number of such paths, were ver- 
tices in the network connected at random. Then each 
term will be greater or less than unity by a factor rep- 
resenting the extent to which the corresponding vertices 
have more or fewer paths of the appropriate length than 
would be expected by chance. In fact, there is no sin- 
gle choice of the parameter <f> that will simultaneously 
achieve this normalization for every term in the series 
but, as we will show, there is a choice that achieves it 
approximately for every term, and exactly in the asymp- 
totic limit of high terms in the series, if wc allow a slight 
(and with hindsight sensible) modification of Eq. (JBJ . 

A. Expected number of paths 

Let us generalize the series, Eq. (jHJ), to allow an in- 
dependent coefficient for each term and for each vertex 



(9) 



And let us choose (for the moment) each coefficient to 
be equal to 1 over the expected number of paths of the 
corresponding length between the same pair of vertices 
on a network with the same degree sequence as the net- 
work under consideration, but in which the vertices are 
otherwise randomly connected. Such a network is called 
a configuration model, and the configuration model has 
been widely studied in the networks literature |3fll3lll32T |. 

The zeroth-order coefficient C$ is trivial: there are 
no paths of length zero between vertices i and j unless 
i = j, in which case there is exactly one such path. So 



The first-order term is more interesting. If 



1 : — 

vertices i and j have degrees ki and kj respectively, then 
we can calculate the expected number of paths of length 
one between them as follows. For any of the fcj edges 
emerging from vertex i, there are 2m places where it 
could terminate, where m is the total number of edges 
in the network. Of these, kj end at vertex j and hence 
result in a direct path of length one from i to j. Thus for 
each edge emerging from i there is a probability kj /2m of 
a length-one path to j, and overall the expected number 
of such paths is kikj/2m. Thus 



2m 

kj k 7 



(10) 



Now consider the second-order term in the series. A 
path of length two between i and j must go through 
a single intermediate vertex v, whose degree we denote 
k v . Using the argument of the preceding paragraph, the 
expected number of paths of length one from i to v is 
kik v /2m. This uses up one of the edges emerging from 
v, leaving k v — 1 remaining edges and thus the expected 
number of paths of length one from v to j, given that 
there is already a path from i to v : is (k v — \)kj /2m. 
The expected number of paths of length two from i to j 
via v is then the product kik v (k v — l)kj / (2m) 2 . Summing 
over all v, the total expected number of paths of length 
two is 



kihj -£m*»-i) = H 



(2m) 5 



(fc 2 ) ~ (k) 
(k) 



(11) 



where (k) and (k 2 ) are the mean degree and mean-square 
degree of the network, and we have made use of the result 
2m = n(k), where n is the total number of vertices in the 
network. 

For paths of length three and greater, the calcula- 
tions become more complicated. Since paths can be 
self-intersecting, we have to consider topologies for those 
paths that include loops or that traverse the same edge 
more than once. While there is only one topology for 
paths of length one or two between a specified pair of 
vertices, there are four distinct topologies for paths of 
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FIG. 2: There is only one possibility topology for paths of 
length one between distinct vertices, and only one for paths 
of length two, but there are four possible topologies for paths 
of length three. 



length three (Fig. |2J, and we must calculate and sum 
the expected numbers of each of them to get the total 
expected number of paths. The end result for paths of 
length three is 



(k 2 ) ~ (k) 
(k) 



k<j ~\~ kj 1 



(12) 



As a check on our calculations, we compare our ana- 
lytic expressions for the numbers of paths of length 2 and 
3 to actual path counts for randomly generated networks 
in Fig. [21 There is increased scatter in the numerical data 
at longer path lengths because the graphs studied are fi- 
nite in size, but overall the agreement between analytic 
and numerical calculations is good. 

While this is rewarding, it is not possible to extend this 
line of investigation much further. The expressions for 
expected numbers of paths become more complicated as 
path length increases and the numbers of distinct topolo- 
gies multiply. So instead, we take a slightly different ap- 
proach. 

The expected number of paths of length / from i to j 
can be written as the jth element of the vector p; given 
by 



Pi = A'v, 



(13) 



where the vector v has all elements zero except for v\ — 
1. In the limit of large I, the vector p; tends toward 
(a multiple of) the leading eigenvector of the adjacency 
matrix, and hence in this limit we have pz+i = Aip;, 
where Ai is the largest eigenvalue of A. Thus the number 
of paths from i to j increases by a factor of Ai each time 
we add one extra step to the path length. The first step 
of the path violates this rule: we know the number of 
paths increases by exactly a factor of ki on the first step. 
Furthermore, since our paths are constrained to end at 
vertex j, the last step must end at one of the kj edges 
emanating from j, out of a total of 2m possible places 
that it could end. This introduces a factor of kj/2m 
into the expected number of paths. Thus, to within a 
multiplicative constant, the number of paths of length I 
from i to j, for large I, should be (kikj /2m) A^ -1 . 

This expression is not in general correct for small 
It is however correct for the particular case I = 1 of 
paths of length one (see Eq. 1)10(1') and we expect it to 
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FIG. 3: (a) The actual number of paths of length two between 
vertex pairs in a configuration model versus the expected 
number of paths, as determined from Eq. 1111 . (b) Same 
as in plot (a), but for paths of length three. 



be approximately correct for other intermediate values of 
I > 1. Guided by these results, we therefore choose the 
constants C] 3 appearing in Eq. © to take the values: 



2m 

ki kj 



-i+i 



(14) 



for I > 1, with C = 5ij. These values approximate the 
desired values based on expected numbers of paths and 
are asymptotically correct in the limit of large I. 



B. Derivation of the similarity 

There is one more issue we need to deal with with be- 
fore we arrive at a final expression for our similarity. If 
we simply substitute G\ J from Eq. I|14fl into Eq. 10 we 
produce a series that unfortunately docs not converge. 
Thus, to ensure convergence, we introduce an extra nu- 
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merical factor a, giving a series thus: 



C. Comparison with previous similarity measures 



Sij — 5ij 



1 3 1=1 
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2m\i 



Ai 



(15) 



In physical terms, the effect of the parameter a is to 
reduce the contribution of long paths relative to short 
ones. That is, for < a < 1, our similarity measure 
considers vertices to be more similar if they have a greater 
than expected number of short paths between them, than 
if they have a greater than expected number of long ones. 
While this is a natural route to take, it does mean we have 
introduced a new free parameter into our calculations. 
This seems a fair exchange: we have traded the infinite 
number of free parameters in the expansion of Eq. for 
just a single such parameter. We discuss the appropriate 
choice of value for a in Section lill Bl 

The first term in Eq. l|15|l is diagonal in the vertices 
i and j and hence affects only the similarity of vertices 
to themselves, which we are not usually interested in. so 
we will henceforth drop it. Thus, our final expression for 
the similarity is 



Si 



2mXi 

k^ kj 



(16) 



Equivalently, wc could write this in matrix form thus: 



S = 2?tiAiD _1 ( I - —A ) D 1 , 



where D is the diagonal matrix having the degrees of the 
vertices in its diagonal elements: Dij = kidij. 

This similarity measure takes exactly the form we pos- 
tulated in Eq. 10 with <fi = a/Xi, except for an over- 
all multiplier, which is trivial, and the leading factor of 
1/kikj, which is not. This factor compensates for the fact 
that we expect there to be more paths between pairs of 
vertices with high degree simply because there are more 
ways of entering and leaving such vertices. Its presence 
is crucial if we wish to compare the similarities of vertex 
pairs having very different degrees. 

In practical terms, the calculation of the similarity 
matrix is most simply achieved by direct multiplication. 
Dropping the constant factor 2m\\ for convenience, we 
can rewrite Eq. I|17|) in the form of Eq. @ thus: 



DSD = — A(DSD) + I. 

Ai 



(18) 



Making any guess we like for an initial value of DSD, 
such as DSD = 0, we iterate this equation repeatedly 
until it converges. In practice, for the networks studied 
here, we have found good convergence after 100 iterations 
or less. 



Several other authors have proposed vertex simila rity 
measures based on matrix methods similar to ours [271 
0. 

Jeh and Widom [28| have proposed a method that they 
call "SimRank," predicated, as ours is, on the idea that 
vertices are similar if their neighbors are similar. In our 
notation, their measure is 



— V A- A • <? 



(19) 



where C is a constant. While this expression bears some 
similarity to ours, Eq. (0, it has an important difference 
also. Starting from an initial guess for Sij, one can iter- 
ate to converge on a complete expression for the similar- 
ity, and this final expression contains terms representing 
path counts between vertex pairs, as in our case. How- 
ever, since the adjacency matrix appears twice on the 
right-hand side of Eq. (|19f) . the expression includes only 
paths of even length. This can make a substantial differ- 
ence to the resulting figures for similarity. An extreme 
example would be a bipartite network, such as a tree or a 
square lattice, in which vertices are separated cither only 
by paths of even length or only by paths of odd length. 
In such cases, those vertices that are separated only by 
paths of odd length will have similarity zero. Even ver- 
tices that are directly connected to one another by an 
edge will have similarity zero. Most people would con- 
sider this result counterintuitive, and our measure, which 
counts paths of all lengths, seems clearly preferable. 

Blondcl et al. |27| considered similarity measures for 
directed networks, i.e., based on asymmetric adjacency 
matrices, which is a more complex situation than the one 
we consider. However, for the special case of a symmetric 
matrix, the measure of Blondcl et al. can be written as 



Sij — C ^ ^ Ai U A V j S uv , 



(20) 



where C is again a constant. This is very similar to the 
measure of Jeh and Widom, differing only in the omission 
of the factor 1 /kikj . Like the measure of Jeh and Widom, 
it can be written in terms of paths between vertices, but 
counts only paths of even lengths, so that again vertices 
separated only by odd paths (such as adjacent vertices 
on bipartite graphs) have similarity zero. 



D. A measure of structural equivalence 

An interesting corollary of the theory developed in 
the previous sections is an alternative measure of struc- 
tural equivalence. The structural equivalence measures of 
Eq. @ can be viewed as similarity measures that count 
only the paths of length two between vertex pairs; the 
number of common neighbors of two vertices is exactly 
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equal to the number of paths of length two. Thus struc- 
tural equivalence can be thought of as just one term — the 
second-order term — in the infinite series that defines our 
measure of regular equivalence. 

The measures of Eq. @ differ from one another in their 
normalization. The developments outlined in this paper 
suggest another possible normalization, one in which we 
divide the number of paths of length two by its expected 
value, Eq. l(TT)l. giving 
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(21) 



If we are concerned only with the comparative similarities 
of different pairs of vertices within a given graph, then 
we can neglect multiplicative constants and write 



|r,nr,| _ |r. ( n r^l 



IE-HE,- 



(22) 



This is, we feel, in many ways a more sensible measure 
of structural equivalence than those of Eq. J5J). It gives 
high similarity to vertex pairs that have many common 
neighbors compared not to the maximum number possi- 
ble but to the expected number of such neighbors, and 
therefore highlights vertices that have a statistically im- 
probable coincidence of neighborhoods. Of course, one 
could define similar measures for paths of length 1 or 3 or 
any other length. Or one could combine all such lengths, 
which is precisely what our overall similarity measure 
does. 



III. TESTS OF THE METHOD 

In this section we test our method on a number of dif- 
ferent networks. Our first example is a set of computer- 
generated networks designed to have known similarities 
between vertices. In following sections we also test the 
method against some real-world examples. 



A. Stratified model network 

In many social networks, individuals make connections 
with others preferentially according to some perceived 
similarity, such as age or income. Such networks are said 
to be stratified, and stratified networks present a perfect 
opportunity to apply our similarity measure: ideally we 
would like to see that given only the network structure 
our measure can correctly identify vertices that are sim- 
ilar in age (or whatever the corresponding variable is) 
even when the vertices are not directly connected to one 
another. 

As a first test of our measure, we have created artificial 
stratified networks on a computer. Such networks offer 
a controlled structure for which we believe we know the 
"correct" answers for vertex similarity. 
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FIG. 4: (a) A density plot of the similarities of all vertex pairs 
not directly connected by an edge in our stratified network 
model. The points give the average similarity as a function of 
age difference and the line is a least-squares fit to a straight 
line, (b) A density plot of the cosine similarity values for the 
same network. 



In our model networks, each of n = 1000 vertices was 
given one often integer "ages." Then edges were created 
between vertices with probability 



P(At) = Po e~ a 



A/ 



(23) 



where At is the difference in ages of the vertices and po 
and a are constants, whose values in our calculations were 
chosen to be po = 0.12 and a = 2.0. Thus the probabil- 
ity of "acquaintance" between two individuals drops by 
a factor of e 2 for every additional year separating their 
ages. 

In order to calculate our similarity measure for this 
or any network we need first to choose a value of the 
single parameter a appearing in Eq. Ijlfijl . In the present 
calculations we used a value of a — 0.97, which, as we will 
see, is fairly typical. Since a must be strictly less than 
one if Eq. ((TBI) is to converge, a = 0.97 is quite close to 
the maximum. We discuss in the following section why 
values close to the maximum are usually desirable. 

Figure^ shows a density plot of the similarity values 
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for all vertex pairs in the model network not directly con- 
nected by an edge, on semi-log scales as a function of the 
age difference between the vertices. The average similar- 
ity as a function of age difference is also plotted along 
with a fit to the data. We exclude directly connected 
pairs in the figure because it is trivial that such pairs 
will have high similarity and most of the interest in our 
method is in its ability to detect similarity in nontrivial 
cases. 

For comparison, we also show in Fig.^3 a density plot 
of the cosine similarity, Eq. (|2b|) , for the same network. 
As the plot reveals, the cosine similarity is a much less 
powerful measure. It is only possible for cosine similarity 
to be nonzero if there exists a path of length two between 
the vertices in question. Vertices with an age difference 
of three or more rarely have such a path in this network 
and, as Fig. shows, such vertices therefore nearly all 
have a cosine similarity of zero. Thus cosine similarity 
finds only highly similar vertices in this case and entirely 
fails to distinguish between vertices with age differences 
between 3 and 9. Our similarity measure by contrast 
distinguishes these cases comfortably. 



B. Choice of a 

Our similarity measure, Eq. (|16|) . contains one free pa- 
rameter a, which controls the relative weight placed on 
short and long paths. This parameter lies strictly in the 
range < a < 1, with low values placing most weight 
on short paths between vertices and high values plac- 
ing weight more equally both on short and long paths. 
(Values a > 1 would place more weight on long paths 
than on short, but for such values the series defining our 
similarity does not converge.) 

In order to extract quantitative results from our sim- 
ilarity measure we need to choose a value for a. There 
is in general no unique value that works perfectly for ev- 
ery case, but experience suggests some reliable rules of 
thumb. Our stratified network model, for instance, pro- 
vides a good guide. Consider Fig. |3J In this figure we 
have calculated the correlation coefficient of the similar- 
ity values for vertex pairs determined using our method 
against the probabilities, Eq. I|23|l . of connections be- 
tween the vertices, which we consider to be a fundamen- 
tal measure of vertices' a priori similarity. As the figure 
shows, the correlation is quite low for low values of a, but 
becomes strong as a approaches one. Only as a gets very 
close to one does the correlation fall off again. This ap- 
pears to imply that a value of a = 0.9 or greater should 
give the best results in this case. Furthermore, it ap- 
pears that, for values of a in this range, the precise value 
does not matter greatly, all values around the maximum 
in the correlation coefficient giving roughly comparable 
performance. 

This we have found to be a good general rule: val- 
ues of a close to the maximum value of 1 perform the 
best, with values in the range 0.90 to 0.99 being typical. 
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FIG. 5: The correlation coefficient r(a, o age ) for correlation 
between our similarity measure and the probability of connec- 
tion, Eq. (1231 , in our stratified model, for a range of values of 
a. The values given are averaged over an ensemble of graphs 
generated from the model. The maximum value is found to 
occur for a ~ 0.97. 



Within this range the results are not highly sensitive to 
the exact value. We give another example to reinforce 
this conclusion below. 

The large typical values of a mean that paths of differ- 
ent lengths are weighted almost equally in our similarity 
measure. In other words, it appears that our measure 
works best when long paths are accorded almost as much 
consideration as short ones. This contrasts strongly with 
structural equivalence measures like the Jaccard index 
and the cosine similarity, which are based exclusively on 
short paths — those of length two. We should be unsur- 
prised therefore to find that our method gives substan- 
tially better results than these older measures, as the 
example above shows. 



C. Thesaurus network 

We now consider two applications of our method to 
real- world networks. The first is to a network of words ex- 
tracted from a supplemented version of the 1911 US edi- 
tion of Roget 's Thesaurus [3i| . The thesaurus consists of 
a five-level hierarchical categorization of English words. 
For example, the word "paradise" (level five) is cataloged 
under "heaven" (level four) , "superhuman beings and re- 
gions" (level three) , "religious affections" (level two) , and 
"words relating to the sentient and moral powers" (level 
one). Here we study the network composed of the 1000 
level-four words, in which two such words are linked if 
one or more of the level-five words cataloged below them 
are common to both. For instance, the level-four words 
"book" and "knowledge" are connected because the en- 
tries for both in the thesaurus contain the level-five terms 
"book learning" and "encyclopedia." 

In Tablc^wc show the words most similar to the words 
"alarm," "hell," "mean," and "water," as ranked first by 
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word 


our measure 


cosine similarity 




warning 


32.014 


omen 


0.51640 


alarm 


danger 


25.769 


threat 


0.47141 




omen 


18.806 


prediction 


0.34816 




heaven 


63.382 


pleasure 


0.40825 


hell 


pain 


28.927 


discontent 


0.28868 




discontent 


7.034 


weariness 


0.26726 




compromise 


20.027 


gravity 


0.23570 


mean 


generality 


19.811 


inferiority 


0.22222 




middle 


17.084 


littleness 


0.20101 




plunge 


33.593 


dryness 


0.44721 


water 


air 


25.267 


wind 


0.31623 




moisture 


25.267 


ocean 


0.31623 



TABLE I: The words most similar to "alarm," "heaven," 
"mean," and "water," in the word network of the 1911 edition 
of Roget's Thesaurus, as quantified by our similarity measure 
and by the more rudimentary cosine similarity of Eq. (121 >t . 
For our measure we used a value of a = 0.98 for the single 
parameter. 
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FIG. 6: The correlation coefficient for correlation between our 
similarity measure and the age difference of all vertex pairs in 
a single network, as a function of a. This plot is typical for 
the school networks studied. 



our similarity measure and second by cosine similarity. 
We used a value of a = 0.98 in this case, on the grounds 
that this value gave the best performance in other test 
cases (see below). 

Since cosine similarity can be regarded as a measure 
of the number of paths of length two between vertices, 
it tends in this example to give high similarity scores 
for words at distance two in the thesaurus — synonyms 
of synonyms, antonyms of synonyms, and so forth. For 
example, cosine similarity ranks "pleasure" as the word 
most similar to "hell," probably because it is closely as- 
sociated with hell's antonym "heaven." By contrast, our 
measure ranks "heaven" itself first, which appears to be 
a more sensible association. Similarly, cosine similarity 
links "water" with "dryness" , whereas our measure links 
"water" with "plunge." 



D. Friendship network of high school students 

As a second real- world test of our similarity measure, 
we apply it to a set of networks of friendships between 
school children. The network data were collected as part 
of the National Longitudinal Study of Adolescent Health 
(AddHealth) 0, and describe 90118 students at 168 
schools, including their school grade (i.e., year), race, and 
gender, as well as their recent patterns of friendship. It 
is well known that people with similar social traits tend 
to associate with one another , so we expect there to 
be a correlation between similarity in terms of personal 
traits and similarity based on network position. This 
gives us another method for checking the efficacy of our 
similarity measure. 

The AddHealth data were gathered through question- 
naires handed out to students at 84 pairs of Ameri- 



can schools, a school pair typically consisting of one 
junior high school (grades 7 and 8, ages 12-14) and 
one high school (grades 9-12, ages 14-18). Here we 
look at a composite of the school pairs with vertices 
from all six grades. Among other things, the question- 
naires circulated during the study asked respondents to 
"List your closest (male/female) friends. List your best 
(male/female) friend first, then your next best friend, and 
so on. (Girls/Boys) may include (boys/girls) who are 
friends and (boy/girl) friends." For each of the friends 
listed the student was asked to state which of five listed 
activities they had participated in recently, such as "you 
spent time with (him/her) last weekend". From these 
answers a weight w(i,j) is assigned to every ordered pair 
of students such that w(i,j) is if i has not listed j 
as a friend or 1 + the number of activities conducted oth- 
erwise. From these weights we construct an unweighted, 
undirected friendship network by adding a link between 
vertices i and j if w(i, j) and w(j, i) are both greater than 
or equal to a specified threshold value W. As it turns out, 
our conclusions are not very sensitive to the choice of W; 
the results described here use W = 2. 

The networks so derived are not necessarily connected; 
they may, and often do, consist of more than one com- 
ponent for each school studied. To simplify matters we 
here consider only on the largest component of each net- 
work. The largest component in some of the networks 
is quite small, however, so to avoid finite size effects we 
have focused on networks of more than 1000 students. 

We first test our similarity measure using the method 
we used for the stratified network of Section IIII Al we 
determine the linear correlation coefficient between age 
difference (measured as difference in grade) and our net- 
work similarity measure, for all vertex pairs in a network. 
We have calculated this correlation coefficient for a range 
of values of a, the free parameter in our measure, and for 
a selection of different networks. The results for one par- 



9 



ticular network are shown in Fig. In this case the 
correlation coefficient is maximized for a ~ 0.99, which 
is again close to the maximum possible value of 1. For 
other networks we find maxima in the range from 0.96 to 
0.99, which is in accord with the results of Section UlI Bl 

These correlations between age difference and network 
similarity appear to indicate that our similarity measure 
is able to detect some aspects of the social structure of 
these networks. To investigate this further, wc have taken 
the optimal values of a from the correlation coefficients 
and used them to calculate the average similarity of ver- 
tex pairs that have a known common characteristic, ei- 
ther grade or race, comparing that average with the aver- 
age similarity for vertex pairs that differ with respect to 
the same characteristic. The results are given in TablellTl 

For school A the average similarity for pairs of students 
in the same grade is a factor of eight greater than that for 
pairs in different grades — an impressive difference. It is 
possible, however, that this difference could result purely 
from the contribution to the similarity from vertex pairs 
that are directly connected by an edge. It would come as 
no surprise that such pairs tend to be in the same grade. 
To guard against this, we give in the fourth column of 
Tabic ITT1 results for calculations in which all directly con- 
nected vertex pairs were removed. Even with these pairs 
removed we see that same-grade vertex pairs are on av- 
erage significantly more similar than pairs from different 
grades. 

We have made similar calculations with respect to the 
race of students. Students in school A did not appear to 
have any significant division along racial lines (columns 
five and six of Table |nj) , but this school was almost en- 
tirely composed of students of a single race anyway, so 
this result is not very surprising; it seems likely that the 
numbers were just too small to show a significant effect. 
School B was similar. Schools C and D, however, show 
a marked contrast. In school C, the average similarity 
for students of the same race is a factor of five greater 
than the average similarity for students of different races. 
School C had a population split 2:1 between two racial 
groups, in marked contrast with schools A and B. School 
D similarly appears to be divided by race, although a 
little less strongly. In this case there is a three-way split 
within the population between different racial groups. 
Possibly this more even split with no majority group was 
a factor in the formation of friendships between students 
from different groups. 



IV. CONCLUSIONS 

In this paper we have proposed a measure of structural 
similarity for pairs of vertices in networks. The method 
is fundamentally iterative, with the similarity of a vertex 
pair being given in terms of the similarity of the vertices' 
neighbors. Alternatively, our measure can be viewed as 
a weighted count of the number of paths of all lengths 
between the vertices in question. The weights appearing 









similarity ratios 




school 


n 


SG:DG 


SG:DG* 


SR:DR 


SR:DR* 


A 


1090 


8.0 


6.1 


1.1 


1.1 


B 


1302 


6.2 


4.4 


2.6 


2.6 


C 


1996 


2.2 


1.9 


5.0 


5.0 


D 


1530 


3.3 


2.6 


4.0 


3.6 



TABLE II: Network size n and ratios of average similarity 
values for school networks in the AddHealth data set. The 
column labeled SG:DG gives the ratio of average similarity 
for students in the same grade (SG) to average similarity for 
students in different grades (DG). The column labeled SR:DR 
gives the ratio of average similarity for students of the same 
race (SR) to average similarity for students of different races 
(DR). Columns marked with asterisks (*) give values of the 
same ratios but omitting vertex pairs connected directly by 
an edge. 



in this count are asymptotically equal to the expected 
numbers of network paths between the vertices, which 
we express in terms of the leading eigenvalue of the ad- 
jacency matrix of the network and the degrees of the 
vertices of interest. The resulting expression for our sim- 
ilarity measure is given in Eq. I|17(l . 

We have tested our measure against computer- 
generated and real-world networks, with promising re- 
sults. In tests on computer-generated networks the mea- 
sure is particularly good at discerning similarity between 
vertices connected by relatively long paths, an area in 
which more traditional similarity measures such as co- 
sine similarity perform poorly. In tests on real-world net- 
works the method was able to extract sensible synonyms 
to words from a network representing the structure of 
Rogct's Thesaurus, and showed strong correlations with 
similarity of age and race in a number of networks of 
friendship among school children. Taken together, these 
results seem to indicate that the measure is capable of ex- 
tracting useful information about vertex similarity based 
on network topology. 

The strength of similarity measures such as ours is 
their generality — in any network where the function or 
role of a vertex is related in some way to its structural sur- 
roundings, structural similarity measures can be used to 
find vertices with similar functions. For instance, similar- 
ity measures can be used to divide vertices into functional 
categories [2(],|3(l,|32l or f° r functional prediction in cases 
where the functionality of vertices is partly known ahead 
of time |38j • We believe that the application of similarity 
measures to problems such as these will prove a fruitful 
topic for future work. 
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